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ABSTRACT 

In this work, decision tree learning algorithms and fuzzy inferencing systems are ap- 
plied for galaxy morphology classification. In particular, the CART, the C4.5, the 
Random Forest and fuzzy logic algorithms are studied and reliable classifiers are de- 
veloped to distinguish between spiral galaxies, elliptical galaxies or star /unknown 
galactic objects. Morphology information for the training and testing datasets is ob- 
tained from the Galaxy Zoo project while the corresponding photometric and spectra 
parameters are downloaded from the SDSS DR7 catalogue. 

Key words: SDSS, Galaxy Zoo, galaxy morphology classification, decision trees, 
fuzzy logic, machine learning 



1 INTRODUCTION 



The exponential rise in the amount of available astronomical 
data has and will continue to create a digital world in which 
extracting new and useful information from archives is and 
will continue to be a major endeavour in itself. For instance, 
the seventh data release of the Sloan Digital Sky Survey 
(SDSS DR7), catalogues five-band photometry for 357 mil- 
lion distinct objects covering an area of 11,663 deg^, over 
1.6 million spectra for 930,000 g alaxies, 120,000 quasars, 
and 460,000 stars over 9380 deg^ ( |Abazajian||2009| ). Future 
wide- field imaging surveys will capture invaluable images for 
hundreds of millions of objects even those with very faint 
magnitudes. 

Most of our current knowledge on galaxy classification 
is based on the pioneering work of several dedicated ob- 
servers who visually catalogued thousands of galaxies. For 
instance, in Fukugita et al. (2007), 2253 objects from the 



SDSS DR3 were classified into a Hubble Type catalogue by 
three people independently. Then, a final classification was 
obtained from the mean. Classifying all objects captured in 
very large datasets produced by digital sky surveys is be- 
yond the capacity of a small number of individuals. This 
therefore calls for new approaches. The challenge here is to 
design intelligent algorithms which will reproduce classifica- 
tion to the same degree as that made by human experts. 
Automated methods which make use of artificial neural net- 



works have already been proposed by Ball et al. ( 2009 ) and 



more recently bylBanerji et al. (2009). Andrae and Melchior 



also propose the use of shapelet decomposition to model the 
natural galaxy morphologies. 

In this study, the advantages obtained by performing 
decision tree learning and fuzzy inferencing for galaxy mor- 



phology classification are investigated. In particular, results 
from the CART, the C4.5 and the Random Forest tree build- 
ing algorithms are compared. The outputs obtained after 
testing a generated fuzzy inference system through sub- 
tractive clustering, are also presented. Ball et al. (2009) 



adopted decision tree approaches for star/galaxy classifica- 
tion. 'Calieja~andT\ientes'{'20^ also attempted to construct 
classifiers from parameters outputted from processed images 
of galaxies. However, such work was only constructed on a 
limited number of attributes and samples. 

The aim here is to develop reliable models to distinguish 
between spiral galaxies, elliptical galaxies or star/unknown 
galactic objects. This through photometric as well as spectra 
data. Classified training and testing samples are obtained 
from the Calaxy Zoo project while the corresponding pa- 
rameters are downloaded from the SDSS DR7 catalogue. As 
discussed by Banerji et al. (2009), this and similar stud- 



ies present us the unique opportunity to compare human 
classifications to those obtained through automated machine 
learning algorithms. Should the automated techniques prove 
to be as successful as the human techniques in separating as- 
tronomical objects into different morphological classes, con- 
siderable time and effort will be saved in future surveys 
whilst also ensuring uniformity in the classifications. 

In the following Section, the Calaxy Zoo Catalogue is 
described. Section [3] explores the various decision tree algo- 
rithms while Section H] attempts to introduce fuzzy inference 
systems. In Section [s] the SDSS photometric and spectra 
parameters used for classification are discussed. The results 
obtained and the overall conclusion are then given in Section 
|6] and Section |7| respectively. 



A. Gauci et al 



Galaxy Zoo Samples Count 




Figure 1. Sample Counts 
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Figure 2. Decision tree for the concept of ShuttleLaunch 



2 THE GALAXY ZOO DATASET 

Galaxy classification is a task that humans excel in. Galaxy 
Zoo realises this and offers web users a valued service where 
volunteers can log in and classify galaxies according to their 
morphological class. The online portal (www.galaxyzoo.org) 
presents its users a sky image which centres on a galaxy ran- 
domly selected from a defined set. Such images are colour 
composites of the g, r and i bands available in the SDSS. 
Users are asked to determine whether the image shown con- 
sists of a spiral galaxy, an elliptical galaxy, a merger, a star 
or an unknown object. Users are also asked to distinguish 
between clockwise, anticlockwise or edge-on spirals. No dis- 
tinction is made between barred or unbarred systems. This 
in turn creates a directory specifying the morphological class 
of each galaxy. Such a project has attracted a considerable 
number of users. An interesting weighting scheme which 
takes into account the similarity of the classifications of each 
user to the rest of the users, is adopted. Full details of this 
are available in Lintott et al. (2008). 



In this work, use of the full sample set of |Bamford et al.| 
(2009), which is updated to include redshifts from SDSS 



DR7 was used. This is based on the magnitude-limited SDSS 
Main Galaxy Sample ( [Strauss et al.|20()2 ). The dataset con- 
sists of 667,935 entries each of which corresponds to an ob- 
ject in the SDSS database (Set 1). For training and testing, 
debiased samples with a probability greater than 0.8 were 
considered (Set 2). In ( ,Banerji et al.,2009, ), the same thresh- 
old was used on the raw morphological type likelihoods. Such 
discrimination is applied in order to identify the samples 
that were tagged with the same morphology by most users. 
After this filtering, the dataset reduced to 251,867 samples 
from which 75,000 were randomly selected to test the al- 
gorithms (Set 3). Using larger training and testing sets did 
not produce any gain in accuracy. Table [l] gives the num- 
ber of galaxies in each morphological class for the defined 
sets whereas Figure [l] reproduces such data graphically. The 
trends in all sets are very similar. This indicates that each 
subset represents a true distribution of the samples in the 
entire catalogue. 



3 DECISION TREES 

The task of classifying galaxies is a typical classification 
problem in which samples are to be categorised into groups 
from a discrete set of possible categories. The main objec- 
tive is to create a model that is able to predict the class 
of a sample from several input parameters. Decision tree 
classification schemes sort samples by determining the cor- 
responding leaf node after traversing down the tree from 
the root. A tree is learnt after splitting the examples into 
subsets based on an attribute value test. Branches of the 
tree are built by recursively repeating this process for each 
node and stops when all elements in the subset at a node 
have the same value of the target variable, or when splitting 
no longer adds value to the predictions. Some tree learning 
algorithms also allow for inequality tests and can work on 
noisy and incomplete datasets. Decision tree learning can be 
successfully applied when the samples can be described via 
a fixed number of parameters and when the output forms 
a discretised set. For instance. Figure |2] presents the learnt 
concept of ShuttleLaunch which suggests whether it is suit- 
able to proceed with a scheduled launching after taking into 
account system checks and forecasted weather. 

Like other supervised learning algorithms, a set of train- 
ing examples with known classification is initially processed 
to infer a decision tree. Hopefully, this would be a good de- 
scription of the classification procedure and will eventually 
be used to distinguish the class of other unseen samples. 
The learnt class description can be understood by humans 
and new knowledge about galaxy morphology classification 
in particular, can be obtained. 

The approach adopted by most algorithms is very much 
the same. Moving in a top down manner, a greedy search 
that goes through the entire space of possible trees is initi- 
ated to try and find the minimum structure that correctly 
represents the examples. The constructed tree is then used 
as a rule set for predicting the class category of an unknown 
sample from the same set of attributes. 
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Table 1. Number of samples in each morphological class in each set 



Set Name 


Number of Objects 


No. of Elhpticals 


No. of Spirals No. of Stars/ Unknown objects 


Set 1 
Set 2 
Set 3 


667,935 
251,867 
75,000 


431,436 
163,206 
49,193 


230,207 6,292 
87,242 1,419 
25,273 534 












1990l. Although a divide and conquer seard 



3.1 C4.5 

Following work done by Hunt in the late 1950s and early 
1960s, Ross Quinlan continued to improve on the developed 
techniques and released the Iterative Dichottomizer 3 (ID3) 
and the improved C4.5 decision tree learners (Kohavi and 



Quinlan 1990). Even though the C5 algorithm is commer- 



cially available, the freely available C4.5 algorithm will be 
brought forward. 

Here trees are built by recursively searching through 
and splitting the provided training set. If all samples in the 
set belong to the same class, the tree is taken to be made of 
just a leaf node. Otherwise, the values of the parameters are 
tested to determine a non trivial partition that separates the 
samples into the corresponding classes. In 04. 5, the selected 
splitting criterion is the one that maximizes the information 
gain and the gain ratio. 

Let RF(Cj,S) be the relative frequency of samples in 
set S that belongs to class Cj . The information that identi- 
fies the class of a sample in set S is: 

X 

After applying a test T that separates set S in Si, S2, ..., Sn, 
the information gained is: 

G{S,T) = I{S)-J2^-^IiSr) 

The test that maximises G(S,T) is selected at the respective 
node. The main problem with this approach is that it favours 
tests having a large number of outcomes such as those pro- 
ducing a lot of subsets each with few samples. Hence the 
gain ratio that also takes the potential information from the 
partition itself is introduced: 



P{S,T) 



X-\S^\, 



If all samples are classified correctly, the tree may be 
overfitting the data and will fail when attempting to classify 
more general, unseen samples. Normally this is prevented 
by restricting some examples from being considered when 
building the tree or by pruning some of the branches af- 
ter the tree is inferred. 04.5 adopts the latter strategy and 
remove some branches in a single bottom up pass. 

One of the main advantages of the 04.5 algorithm is 
that it is capable of dealing with real, non- nominal attributes 
and so renders itself compatible with continuous parameters. 
It can also handle missing attribute data. 

3.2 CART 

The Olassification and Regression Tree (OART) scheme was 
developed by Friedman and Breiman ([Kohavi and Quinlan 



lar to that of 04.5 is used, the resulting tree structure, the 
splitting criteria, the pruning method as well as the way 
missing values are handled, are redefined. 

OART only allows for binary trees to be created. While 
this may simplify splitting and optimally partitions cate- 
gorical attributes, there may be no good binary split for a 
parameter and inferior trees might be inferred. However, for 
multi-class problems, twoing may be used. This involves sep- 
arating all samples in two mutually exclusive super-classes 
at each node and then apply the splitting criteria for a two 
class problem. 

As a splitting criterion, OART uses the Gini diversity 
index. Let RFiCj , S) again be the relative frequency of sam- 
ples in set S that belongs to class Cj, then the Gini index 
is defined as: 

X 

and the information gain due to a particulat test T can be 
computed from: 



G{S,T) — Igini\o) 2_^ \c\ J-gini{Si) 



\s\ 



As with the 04.5 algorithm, the split that maximises 
G(S,T), is selected. If all samples in a given node have the 
same parameter value, then the samples are perfectly ho- 
mogenous and there is no impurity. 

The OART algorithm also prunes the tree and use 
cross validation methods that may require more computa- 
tion time. However, this will render shorter trees than those 
obtained from 04.5. Samples with missing data may also be 
processed. 



3.3 Random Forests 



Breiman and Outler (2001 ), the pioneers of random forests, 
suggest using a classifier in which a number of decision trees 
are built. When processing a particular sample, the output 
by each of the individual trees is considered and the result- 
ing mode is taken as the final classification. Each tree is 
grown from a different subset of examples allowing for an 
unseen (out of bag) set of samples to be used for evaluation. 
Attributes for each node are chosen randomly and the one 
which produces the highest level of learning is selected. It is 
shown that the overall accuracy is increased when the trees 
are less correlated. Having each of the individual trees with 
a low error rate, is also desirable. 

Apart from producing a highly accurate classifier, such 
a scheme can also handle a very large amount of samples and 
input variables. A proximity matrix which shows how sam- 
ples are related, is also generated. This is useful since such 
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relations may be very difficult to be detect just by inspec- 
tion. With this strategy, good results may still be obtained 
even when a large portion of the data is missing or when the 
number of examples in each category is biased. 



4 FUZZY INFERENCE SYSTEMS 

Fuzzy logic accommodates soft computing by allowing for an 
imprecise representation of the real world. In crisp logic a 
clear boundary is considered to separate the various classes 
and each element is categorised into one group such that 
samples in sets A and not A represents the entire dataset. 
Fuzzy logic extends on this by giving all sample a degree of 
membership in each set hence also caters for situations in 
which simple boolean logic is not enough. If classically set 
membership was denoted by (false) or 1 (true), now we 
can also have 0.25 or 0.75. In fuzzy logic, the truth of any 
statement becomes a matter of degree. 

The mathematical function that maps each input to the 
corresponding membership value between and 1 is known 
as the membership function. Although this can be arbitrary, 
such function is normally chosen with computation efficiency 
and simplicity kept in mind. Various common membership 
functions include the triangular function, the trapezoidal 
function, the gaussian function and the bell function. The 
latter are the most popular and although they are smooth, 
concise and can attain non-zero values anywhere, they fail 
in specifying asymmetric membership functions. Such limi- 
tation is elevated through the use of sigmoid functions that 
can either open left or right. 

For an inference system, if — then rules that deal with 
fuzzy consequents and fuzzy antecedents are defined. An ag- 
graded fuzzy set is then outputted after these conditional 
rules are compared and combined by standard logical oper- 
ators equivalents. Since the degree of membership can now 
attain any value between and 1, the AND and OR opera- 
tors are replaced by the max and min functions respectively. 
The resulting output is then defuzzified to obtain one output 
value. 

Fuzzy inference systems are easily understood and can 
even be applied when dealing with imprecise data. Like de- 
cision tree classifiers, they provide a penetrative model that 
experts can analyze and even add other information to it. 
Such inference approaches have already been successfully 
applied in a number of applications that range from inte- 
gration in consumer produces to industrial process control, 
medical instrumentation and decision support systems. 



5 INPUT PARAMETERS 

In all machine learning algorithms, the set of input param- 
eters strongly determine the overall accuracy of the clas- 
sifier. Ideally, a minimum number of attributes that can 
differentiate between the three galaxy morphology classes 
are required. For this work, photometric and spectra values 
downloaded from the SDSS Phot oObj All and SpecLineAll 
tables, were used. Data for which classification information 
is available in the Galaxy Zoo catalogue were downloaded 
and used to test the various machine learning algorithms 
used. 



Table 2. Set of input parameters that are band independent, 
from the i band (?^ 700nm — 1400nm) and from the r band 
(^ 700nm) 



Name 



Description 



dered_g - dered_r 
dered_r - dered_i 



deredded (g - r) colour 
deredded (r - i) colour 



deVAB_i 

expAB_i 

lnLExp_i 

lnLDeV_i 

lnLStar_i 

petroR90_i / petroR50_i 

mRrCc_i 

texture_i 

mEl.i 

mE2_i 

mCr4_i 



DeVaucouleurs fit axis ratio 
Exponential fit axis ratio 
Exponential disk fit log likelihood 
DeVaucouleurs fit log likelihood 
Star log likelihood 
Concentration 

Adaptive (+) shape measure 
Texture parameter 
Adaptive El shape measure 
Adaptive E2 shape measure 
Adaptive fourth moment 



deVAB_r 

expAB_r 

lnLExp_r 

lnLDeV_r 

lnLStar_r 

petroR90_r / petroR50_r 

mRrCc_r 

texture_r 

mEl_r 

mE2_r 

inCr4_r 



DeVaucouleurs fit axis ratio 
Exponential fit axis ratio 
Exponential disk fit log likelihood 
DeVaucouleurs fit log likelihood 
Star log likelihood 
Concentration 

Adaptive (+) shape measure 
Texture parameter 
Adaptive El shape measure 
Adaptive E2 shape measure 
Adaptive fourth moment 



5.1 Photometric Attributes 



In this study, the set of 13 parameters as taken by [Banerji] 
et al.| (2009) which are based on colour, profile fitting and 



adaptive moments were used. However, we did not limit the 
evaluation to the i band but also aimed at testing whether 
the values derived from the r band give equal or better classi- 
fication accuracies. The input parameters used are presented 
in Tabled 

The DeVaucouleurs law provides a measure of how the 
surface brightness of an elliptical galaxy varies with appar- 
ent distance from the centre. This should provide a good 
element of discrimination between spiral and elliptical pro- 
files. The InLStar parameter also helps to separate galaxy 
from star objects. The concentration parameter is given by 
the ratios of radii containing 90% and 50% of the Petrosian 
flux in a given band. The texture parameter compares the 
range of fluctuations in the surface brightness of the object 
to the full dynamic range of the surface brightness. It is ex- 
pected that this is negligible for smooth profiles but becomes 
significant in high variance regions such as spiral arms. 

The other parameters used are based on the object's 
shape. Particularly, the adaptive moments derived from the 
SDSS photometric pipeline are second moments of the ob- 
ject intensity, measured using a particular scheme designed 
to have an optimal signal to noise ratio. These moments 
are calculated by using a radial weight function that adopts 
to the shape and size of the object. Although theoretically 
there exists an optimal radial shape for the wight function 
related to the light profile of the object, a Gaussian with size 



matched to that of the object is used (Bernstein and Jarvis 
|2002|. 
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The sum of the second moments in the CCD row and 
column direction (mRrCc) is calculated by: 

mRrCc =< c^ > + < r^ > 

where c and r correspond to the columns and rows of the 
sensor respectively and the second moments are defined as 

^ ^2 ^^ ^[I{r,c)w{r,c)c^] 
^[I{r,c)w{r,c)] 

I is the intensity of the object and w is the weighting func- 
tion. The ellipticity/polarisation components are defined by: 

/ 2 ^ < r^ > 
ruei =< c > 



Table 3. Wavelengths of spectra lines 



MRrCc 

< (c)(r) > 
MRrCc 
A fourth order moment is also defined as: 



me2 



TTicrA 



<{c^ + r^ 



> 



In this case, a is the weight of the Gaussian function applied. 



5.2 Spectra Attributes 

To try and differentiate between elliptical, spiral and other 
morphologically shaped galaxies, this study also makes use 
of strong emission lines captured in galactic spectra. Signif- 
icant lines of oxygen and hydrogen around the 5000Aand 
TOOOAmarks are expected for spiral galaxies. Since these of- 
ten have star forming regions in the arms, the presence of 
sulphur, nitrogen and helium originating from ionized gas 
clouds, is also expected. Elliptical galaxies on the other hand 
are believed to have no star forming activity and can there- 
fore be identified from continuous and average spectra. 

Part of the SDSS pipeline is responsible to detect and 
store all strong emission lines present in the captured spec- 
tra. This is achieved through wavelet filters. An attempt to 
match all peaks with one of the candidate emission lines 
defined in a list, is made. Each line is then fitted with a 
single Gaussian by the SLATEC common mathematical li- 



brary routine SNLSIE (|SDSS|. The height and the disper- 
sion of the fitted Gaussian, the resulting Chi-squared error 
and other derived parameters are stored in the correspond- 
ing database. Table [3] presents the wavelengths of lines con- 
sidered for this study. Lines storing missing dummy values 
for more than 5% of the dataset, were ignored. Although still 
randomly selected, samples for the training and testing sets 
were biased towards entries with a lower chi-squared error. 
This allowed for height values of better fitted Gaussians to 
be considered. 



6 RESULTS 

Initially, the 13 photometric parameters derived from the i 
band were standardised and independent component analy- 
sis was performed to determine the most significant compo- 
nents. As can be seen from the resulting eigenvalues shown 
in Figure [3] all of the independent components attain a non- 
zero value. This implies that all attributes are important for 
galaxy classification and dimension reduction is unnecessary. 
Figure [4] and Figure |5] show the eigenvalues obtained when 
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Figure 3. Eigenvalues from the 13 i band parameters 

the same analysis was repeated on r band data as well as 
on the 24 attributes obtained when the i and r bands pa- 
rameters were combined. 

Once the significance of the selected parameters was 
confirmed, various machine learning algorithms were tested. 
In particular, the CART algorithm, the C4.5 algorithm with 
confidence values of 0.25 and 0.1 and the Random Forest al- 
gorithm with 10 and 50 trees were considered. For all test 
cases, a ten-fold cross validation strategy was used. The com- 
piled 75,000 sample set (Set 3) was divided into 10 compli- 
mentary subsets and the learning algorithm was executed 
for 10 times. In each run, one of the ten subsets was used 
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Figure 5. Eigenvalues from the 24 i and r band parameters 
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C4.5 (0.25 confidence) 
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C4.5(0.1 confidence) 
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Random Forest (10 trees) 
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Random Forest (50 trees) 




E 


S 


U 


S 


E 


98.21 % 


1 .77 % 


0.02 % 


S 


3.78 % 


96.10% 


0.12% 


U 


3.24 % 


10.15% 


86.62 % 



Classifier 


Accuracy 


CART 


96.227 % 


C4.5 (0.25 confidence) 


96.203 % 


C4.5 (0.1 confidence) 


96.288 % 


Random Forest (10 trees) 


96.979 % 


Random Forest (50 trees) 


97.331 % 



Figure 6. Decision tree confusion matrices for the i band input 
parameters 



for testing and the other nine subsets were put together to 
form the training set. The presented results are the com- 
puted averages across all ten trials. By this approach, every 
sample is part of the test set at least once. 



6.1 i Band Photometric Parameters 

The resulting confusion matrices when considering the 13 i 
band parameters are shown in Figure [6] These tables corre- 
late the actual morphological classes with those outputted 
by the classifier. For instance, in the first row, the percent- 
ages of elliptical galaxies that were classified as elliptical (E) , 
spiral (S) and unknown (U) are shown. Decision trees out- 
put a single morphology type for every input, therefore the 
percentages in each row add up to 100%. This corresponds 
to all of the input samples of a particular class. The global 
accuracy percentage was then calculated by comparing the 
total number of correctly classified samples with the total 
number of inputted tests. 

In all decision tree algorithms tested, the global accu- 
racy is always above 96.2% with the highest being 97.33% 
achieved by the 50 tree random forest technique. All confu- 
sion matrices result to have the highest values in the diag- 
onal. This indicates that the majority of samples were clas- 
sified correctly. The random forest algorithm with 50 trees 
did prove to be the most accurate and did manage to cor- 
rectly classify 98.21% of all ellipticals, 96.10% of all spirals 
and 86.62% of all unknown objects. The slightly less than 
optimal classification percentages for unknown objects can 
be due to a number of factors. First of all, the number of 
training samples with unknown morphology might have not 
been enough for the algorithm to learn how to identify such 
samples and secondly, objects that mislead humans might 
actually have very similar properties to spiral or elliptical 
galaxies and are ultimately also classified correctly by the 
algorithm. 

The membership functions derived by the fuzzy infer- 
ence system for the DeVaucouleurs fit axis ratio, exponential 
fit axis ratio and concentration parameters in the i band, are 
shown in Figure [7| For such a model, subtracting clustering 
was used. The results obtained after testing are presented in 
Figure [8] Clearly, the developed model is capable of describ- 
ing elliptical and spiral galaxies but suffers to accurately 
detect galaxies tagged to have an unknown type. 
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Figure 7. Fuzzy inference system membership functions 

6.2 r Band Photometric Parameters 

The CART, C4.5 and Random Forest decision tree algo- 
rithms were also applied to objects in Set 3 with the input 
parameters extracted from the r band. The resulting con- 
fusion matrices are presented in Figure |9] When compared 
to the results obtained from the i band data, a gain in the 
general accuracies of all algorithms can be noted. Using the 
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Figure 9. Decision tree confusion matrices for r band input pa- 
rameters 
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Figure 11. 

parameters 



Decision tree confusion matrices for spectra input 
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Figure 10. Decision tree confusion matrices for 
input parameters 



and r bands 



r band information seems to help in distinguishing between 
elhpticals and spirals. However, the same cannot be said for 
the unclassified class since accuracies for this morphology 
class depreciated from an overall average of about 81% to 
that of about 75%. 



6.3 i and r Bands Photometric Parameters 



Following the tests described in Section [671] and Section [672] 
above, models built from the i and r bands parameters were 
tested. A 24 attribute dataset was constructed for objects 
defined in Set 3 and the same decision tree algorithms were 
applied to obtain corresponding classifiers. The results are 
presented in Figure [lO] Although an improvement in accu- 
racy is registered by the 50 tree Random Forest algorithm, 
this is only by a very small percentage. 



6.4 Spectra Parameters 

A similar methodology was adopted to test classification ac- 
curacies of models developed from spectra data. All wave 
line entries for objects in Set 3 were initially downloaded 
from the SDSS database. As described in Section |5.2| the 
wavelengths for which more than 95% of the data was avail- 
able, were considered. This allowed for a 24 attribute feature 
space and the achieved results are presented in Figure [TT] 
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Figure 12. Data, algorithms and results 

Although still reasonably accurate, the general classification 
capability was found to be less than that obtained when pho- 
tometric parameters were used. This could be due to the fact 
that peak spectral lines which are significant to detect spiral 
galaxies, are not always present. 



7 CONCLUSION 

In this study, accuracies for different galaxy morphology 
classification models developed through various machine 
learning techniques were obtained and analyzed. Results 
from the CART, the C4.5 and the Random Forest deci- 
sion tree algorithms as well as the output from Fuzzy In- 
ference Systems, were compared. The advantages gained by 
performing computations on different photometric parame- 
ters and on spectra attributes, were also investigated and 
put forward. Figure [12] serves as a good summary of which 
data and algorithms were used as well as the overall accura- 
cies obtain. In all cases, the Random Forest gave the highest 
percentages especially when 50 trees were used. 

All of the tested algorithms took only a few minutes to 
run on a normal personal computer. Although the presented 
results are for Set 3, experiments on Set 2 that stored more 
samples were also carried out. Accuracy percentages very 
close to the ones published, were obtained. 
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Figure 13. Samples of spiral (top), elliptical (middle) and un- 
known (bottom) galaxies that were incorrectly classified by the 
fuzzy inference system 



In most cases, when processing photometric parameters, 
the adaptive shape measure (mRrCc) parameter was chosen 
as the root of the tree. First level nodes included the con- 
centration (petroR90/petroR50) and the dered_g-dered_r 
parameters. For spectra data, the Ha wave line was deter- 
mined to provide the highest information gain while the Hb 
and the K lines were chosen as first level nodes. 

Figure ^] shows samples of incorrectly classified galax- 
ies by the fuzzy inference system. Although this is not the 
most accurate technique described, the incorrectly classified 
spiral and elliptical samples are very faint in magnitude. 
Moreover, all incorrectly classified unknown objects have 
bright sources in the vicinity and this could have had an 
effect on the calculated parameters by the SDSS photomet- 
ric pipeline. 
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